Welcome to this workshop on ANNs in R

Assumed Workshop Prerequisites are:

  • Comfortable using and navigating the RStudio IDE for R dev
  • Experience with the Base and Tidyverse dialects of R
  • Rmarkdown and basic data science skills
  • This workshop is introductory, so no prior knowledge of neural networks is required

General Workshop Info

  • Not all code details are important, so do not worry if not everything makes sense
  • All materials are and will remain open source under GPL-3.0 on GitHub, so you can revisit the entire workshop any time you would like to!
  • Per previous point, do not record this workshop and do not take screen shots

Workshop Learning Objectives

A participant who has met the objectives of this workshop will be able to:

  • Conceptually describe
    • What an ANN is
    • How an ANN is trained
    • How predictions are made
    • What ANN hyperparameters are
  • Create a simple dense ANN model in R using TensorFlow via Keras
  • Apply the created model to predict on data

Workshop Limitations

  • The aim of this workshop is to introduce you to artificial neural networks in R

  • The key here being introduce, we have limited time, so you will mainly be working with code I created

  • We have 4h in total, realistically we can only scratch the surface of neural networks/deep learning

  • If you have little-to-no-experience with base R and/or Tidyverse, expect the workshop to feel overwhelming

  • Workshop materials will remain open, so my intention is, that you can revisit and study them further after today’s workshop

Your host for the day will be… Me!

On Artificial Neural Networks (ANNs)

What are ANNs?

  • A mathematical framework inspired by neuron network structure of the human brain

Source: Bruce Blaus | Multipolar Neuron | CC BY 3.0

  • In reality we do not really know how the human brain learns, we only know, that it is capable of processing “data”

“Inspired by neuron network structure”

  • If you google “Artificial Neural Networks”, you will get something like this:

  • Let’s demystify this…

“Inspired by neuron network structure”

  • \(I_i...I_n\): Input layer variables/features, \(B_I\): Bias/intercept
  • \(H_j...H_m\): Hidden layer, \(B_H\): Bias/intercept
  • \(O\): Output layer, response/outcome

Making a Prediction: The feed forward algorithm

Example: Fully Connected Neural Network

  • To put it simple, the input vector I is transformed to a prediction O

  • The input vector is simply the set of variables in your data for a single observation, e.g.

## # A tibble: 10 × 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species   
##           <dbl>       <dbl>        <dbl>       <dbl> <fct>     
##  1          5.5         2.4          3.7         1   versicolor
##  2          5.7         2.9          4.2         1.3 versicolor
##  3          5.5         2.5          4           1.3 versicolor
##  4          6.1         3            4.6         1.4 versicolor
##  5          6.4         2.8          5.6         2.2 virginica 
##  6          5.2         3.4          1.4         0.2 setosa    
##  7          5.1         3.8          1.6         0.2 setosa    
##  8          5.7         3            4.2         1.2 versicolor
##  9          6           2.2          5           1.5 virginica 
## 10          7.2         3            5.8         1.6 virginica
  • \(Species \sim f(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width)\)

  • We can visualise this like so

Example: Fully Connected Neural Network

  • I is the input vector, H is the hidden layer and O is the output
  • B is the bias neuron, think intercept in the familiar \(y = b + a \cdot x\)

Example: Fully Connected Neural Network

  • Flow from input layer (features) to hidden layer:

    \(H_{j} = I_{i} \cdot v_{i,j} + I_{i+1} \cdot v_{i+1,j} + I_{i+...} \cdot v_{i+...,j} + I_{n} \cdot v_{n,j} + B_{I} \cdot v_{n+1,j} =\) \(\sum_{i}^{n} I_{i} \cdot v_{i,j} + B_{I} \cdot v_{n+1,j} = \sum_{i}^{n+1} I_{i} \cdot v_{i,j} = \textbf{I} \cdot \textbf{v}_j\)

  • Non-linear transformation of hidden layer input to hidden layer output (activation function):

    \(S(H_{j}) = \frac{1}{1+e^{-H_{j}}}\)

Example: Fully Connected Neural Network

  • Flow from hidden layer to output layer:

    \(O = H_{j} \cdot w_{j} + H_{j+1} \cdot w_{j+1} + H_{j+...} \cdot w_{j+...} + H_{m} \cdot w_{m} + B_{H} \cdot w_{m+1} =\) \(\sum_{j}^{m} H_{j} \cdot w_{j} + B_{H} \cdot w_{m+1} = \sum_{j}^{m+1} H_{j} \cdot w_{j} = \textbf{H} \cdot \textbf{w}\)

  • Non-linear transformation of output layer input to output layer output (activation function):

    \(S(O) = \frac{1}{1+e^{-O}}\)

Training a Network: The Back Propagation Algorithm

Example: Fully Connected Neural Network

  • Activation function (non-linearity):
    • \(s(x) = \frac{1}{1+e^{-x}}\)
  • Loss function (error):
    • \(E = MSE(O,t) = \frac{1}{2} \left( o - t \right)^2\)
  • Optimisation using gradient descent (weight updates):
    • \(\Delta w = - \epsilon \frac{\partial E}{\partial w} \Leftrightarrow w_{new} = w_{old} - \epsilon \frac{\partial E}{\partial w}\)
    • \(\Delta v = - \epsilon \frac{\partial E}{\partial v} \Leftrightarrow v_{new} = v_{old} - \epsilon \frac{\partial E}{\partial v}\)
    • Where \(\epsilon\) = learning rate

Activation Function Examples

Activation Function - Sigmoid

  • Low input and the neuron is turned off (emits 0)
  • Medium input and the neuron emits a number inbetween 0 and 1
  • High input and the neuron is turned on (emits 1)

Activation Function - Rectified Linear Unit

  • Input less than zero and the neuron is turned off (emits 0)
  • Input larger than zero and the neuron simply propagates the signal (emits x)

Activation Function - Leaky Rectified Linear Unit

  • Input less than zero and the neuron is almost turned off (emits a small number)
  • Input larger than zero and the neuron simply propagates the signal (emits x)

Activation Function - Output neuron(s)

  • Choice of activation function for output neuron(s) depend on aim

    • Binary Classification: Sigmoid
    • Multiclass Classification: Softmax, softmax\((x_i) = \frac{e^{x_i}}{\sum_{i=1}^{n} e^{x_i}}\)
    • Regression: Linear

Optimiser: Stochastic Gradient Descent

  • We need to find the value of our weight resulting in the smallest possible parameter cost
  • The optimisation cannot be solved analytically, so numeric approximations are used
  • E.g. SGD back-propagates the loss per single observation allowing fluctuations

Optimiser: Stochastic Gradient Descent

Summary

Key Terms and Concepts

  • Input layer: The first layer of neurons being fed the examples from the feature matrix
  • Hidden layer(s): The layers connecting the visible input and output layers
  • Output layer: The layer creating the final output (prediction)
  • Feed forward algorithm: The algorithm used to make a prediction, where information flows from the input via the hidden to the output layer
  • Activation function: The function used to make a non-linear transformation of the set of linear combinations feeding into a neuron
  • Back propagation algorithm: The algorithm used for iteratively training the ANN
  • Loss/error function: The function used to measure the error between the true and the predicted value, when training the ANN
  • Optimiser: The function used for optimising the weights, when training the ANN
  • Epoch: One run through all training examples
  • An ANN can do both binary and multiclass classification and also regression

Important

  • GitHub preview does not work on slides

  • All slides are available in the Talks-dir in your RStudio session Files pane, left-click on the .html file and choose ‘View in Web Browser’

  • The exercises are in the Exercises-dir, left-click on the 00_exercises.html file and choose ‘View in Web Browser’

  • You can also click directly on the exercise link in the workshop schedule

Time for exercises!

  • First, go to https://github.com/leonjessen/workshop_anns_in_r to follow the schedule

  • Then, please proceed to the exercise on prototyping an ANN

  • At the exercises, I will strongly advice to pair up two-and-two, so you can discuss

  • Internalising knowledge is much more efficient, when you are forced to put concepts into words